Overview
We think the analogy to using R is clear:
- If you are anxious, stressed or avoidant you will be distracted
- Getting confident with the basics makes more complex techniques possible
TODO: replace with feelgood video
In this session we cover:
- Loading data from files
- Using simple techniques to answer research questions with data
- Saving intermediate steps using variables
Principles/ideas
- Using data to answer questions
- Precision and literal-mindedness of R
- Paths and directories
R techniques covered
Storing data in variables
TODO: replace with video
Video summary:
- In R, a variable is the name for a container which stores data.
- We make variables using the assignment operator, which looks like this:
<-. - Values on the right hand side of
<-are stored in the variable on the left hand side. - Variables that you create are stored in the
Global Environment, which you can see using the Environment pane.
# calculate 40 + 2 and assign the result to a variable
meaning_of_life <- 40 + 2
# print variable
meaning_of_life
[1] 42As we work, it’s useful to be able to save the results of the code we write.
As one example, we might have a dataset with multiple columns, each holding participants’ answers to an individual questionnaire item. We might want to calculate a new column —— maybe an average of each person’s scores on all of the questions —— and keep track of this so we can use it in later calculations.
Alternatively, we might want to save the result of a specific calculation and use it later on.
To do this we can create a variable.
A variable is just a container to store data in. To make variables we use the assignment operator, which looks like this <-
That is, like an arrow that points to the left. This is a reminder that the results of the calculation on the right hand side will be assigned (stored) in the variable on the left hand side.
The code in this chunk runs the calculation on the right hand side of the assignment operator, 40 + 2, and assigns the result to a new variable named meaningoflife. The output of the chunk is 42, the value of meaningoflife.
Give your variables short names which describe the data they contain. Use the underscore _ if you need to use more than one word e.g. meaning_of_life.
You might wonder where these variables get saved. In most cases, variables you create are stored in what’s called the Global Environment. You can see them in the Environment pane in RStudio. Double-clicking on any variable there will show you what is stored inside the container.
Exercise 1
- Open
session-2.rmdusing the Files pane. This is the workbook you will be using in this session. - Run the first chunk in the workbook.
The output should look like this:
Results of creating meaningoflife variable
Your Environment pane should look like this:
Environment pane after creating variable
Exercise 2
- Create a level 3 markdown heading named “Exercise 2” in your workbook
- Create a new chunk beneath the heading
- Assign the results of the calculation
2 * 35to the variableseventy - Run the chunk
Your Environment should now look like this:
Environment pane after creating new variable
Exercise 3
- Create a level 3 markdown heading named “Exercise 3” in your workbook
- Use R to calculate your age in the year 2051.
- Save the result in a variable with a descriptive name.
Passing data to commands using the pipe %>%
TODO: replace with video
Video summary:
- We pass data from one piece of code to another using the pipe command, which looks like this:
%>%. - A pipeline is a sequence of two or more commands joined by
%>%. - You can use the assignment operator to store the results of a pipeline in a variable.
# pipe mtcars into head()
mtcars %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# store first few rows of mtcars
mtcars_head <- mtcars %>%
head()Sometimes we need to link together multiple steps in our analysis.
For example, if we’re working with a big dataset we might want to select only some of the columns, and then filter out some of the rows of data, and the finally calculate descriptive statistics.
We could do this by creating lots of variables, each one saving the results at each intermediate step. This can get confusing, though.
Instead we can use what’s known as a ‘pipe’ — it’s another way to link together multiple instructions.
The pipe sends data from one piece of code to another.
The pipe looks like this %>%.
In session 1, you used this command to “pipe” the mtcars dataset into head, which shows just the first few rows:
mtcars %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1You can think of your data as flowing along lengths of pipe, joined by commands which do things to the data, step by step, until the result you want plops out at the end.
Each command should be read as the word “then”, e.g. “pipe mtcars data, then head() it”.
The > in the pipe command reminds you of the direction in which your data is flowing (it only works left to right).
It’s important to know that the pipe command doesn’t store the results of these steps.
Sometimes that’s OK. In our first example we just wanted to look at the first few rows of the mtcars data.
But, you will usually want to save the result of a pipeline in a new variable.
For example, if we wanted to save the first few rows of the mtcars data to a new variable we would write:
mtcars_head <- mtcars %>% head()Here we combine assignment with a pipeline.
The result of the pipeline (a data.frame containing the first few rows of mtcars) is saved to a new variable called mtcars_head.
You can explore your variables using the Environment pane. A data.frame will have an icon that looks like a spreadsheet. If you [click on the icon], the data.frame is displayed in a new tab in the Source pane.
This tab shows you the same information as printing the data.frame, such as the number of rows and columns, but it also provides tools for exploring the data interactively.
- The arrows next to the column names allow you to arrange the rows in ascending or descending order based on the column values.
- The
Filterbutton allows you to specify a value for one or more columns to filter out non-matching rows. For example, we could display just cars with 4 gears. Click the button again to turn off the filter.
Exercise 4
- Create a level 3 markdown heading named “Exercise 4” in your workbook. (You should be used to doing this for every exercise by now, so we won’t remind you again.)
- Create a new chunk beneath the heading
- Load the
tidyverselibrary - Pipe the
mpgdata.frameintohead()and assign the results to a variable calledmpg_head - Use the Environment pane to open
mpg_head
In 1999, a 6 cylinder, manual transmission, Audio A4 could cover miles per gallon when driven in the city.
Loading data from elsewhere
TODO: replace with video
Video summary:
- Often we want to load data into R, rather than use built-in datasets.
- The preferred format for data files in R is comma-separated value (CSV).
- CSV data can be read using the
read_csv()command. - You can load data from an internet address (URL) or a file uploaded to the server.
Loading data
In a lot of these sessions we use datasets that are built-in to R because it’s quick and convenient to illustrate the points we make.
[demo opening glancing some built in data like gapminder, iris, mtcars etc]
Normally, though, you will need to load your own data.
R can read data from two places:
- A URL (web address), if the data file is available on the internet somewhere
- A file on computer that R is running on
The link below is a URL (web address) for a file containing data about US police shootings.
The final part of the url tells us the name of the file: shootings.csv
The final 3 (sometimes 4) letters of the filename is called the file extension.
Here the file extension is .csv, which stands for ‘comma separated values’ or CSV.
CSV is a common data type. Most data-oriented programmes (e.g. Excel or Open Office or SPSS) can read and write .csv files, so it’s a good choice for storing and sharing data.
If you click on the link [click link in vid] you’ll see the first line is a list of column names separated by commas.
The remaining lines contain rows of data matching the column headings. For example, the value of the arms_category column in row 1 is Guns.
The read_csv() command reads a CSV file, and converts it to a data.frame, which is the format we use in R.
We can use read_csv() to load data from either a file, or over the internet, which is shown in the next video.
Reading CSV files from the internet
TODO: replace with video
Video summary:
read_csv('http://...')can load data from a URL.- It converts the data to a
data.frame. - You must assign the loaded data to a variable, which you should give a descriptive name.
- Use the Environment pane to view data you load using
read_csv().
# load data from a URL into a variable
shootings <- read_csv('https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv')
# display data
shootings
# A tibble: 4,895 x 15
id name date manner_of_death armed age gender race city state
<dbl> <chr> <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 3 Tim E… 2015-01-02 shot gun 53 M Asian Shel… WA
2 4 Lewis… 2015-01-02 shot gun 47 M White Aloha OR
3 5 John … 2015-01-03 shot and Tasered unar… 23 M Hisp… Wich… KS
4 8 Matth… 2015-01-04 shot toy … 32 M White San … CA
5 9 Micha… 2015-01-04 shot nail… 39 M Hisp… Evans CO
6 11 Kenne… 2015-01-04 shot gun 18 M White Guth… OK
7 13 Kenne… 2015-01-05 shot gun 22 M Hisp… Chan… AZ
8 15 Brock… 2015-01-06 shot gun 35 M White Assa… KS
9 16 Autum… 2015-01-06 shot unar… 34 F White Burl… IA
10 17 Lesli… 2015-01-06 shot toy … 47 M Black Knox… PA
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
# threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>CSV files are a common format to store and share data. As shown in the previous video, the first line of a CSV file defines the column names, and the remaining lines are rows of data.
The read_csv() command reads a CSV file, and converts it to a data.frame, which is the format we use in R. We can load data either from a file, or over the internet.
In this example, I’m reading a CSV directly over the Internet and storing the resulting data.frame in the variable shootings.
The URL (the link to the CSV file) needs to be in quotes (single or double quotes both work).
shootings <- read_csv('https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv')Because we made a new variable, the result is stored in the Environment, and we can double-click it to have a look at the data.
An alternative (and recommended) way is to type the name of the variable as a very simple command:
shootings
# A tibble: 4,895 x 15
id name date manner_of_death armed age gender race city state
<dbl> <chr> <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 3 Tim E… 2015-01-02 shot gun 53 M Asian Shel… WA
2 4 Lewis… 2015-01-02 shot gun 47 M White Aloha OR
3 5 John … 2015-01-03 shot and Tasered unar… 23 M Hisp… Wich… KS
4 8 Matth… 2015-01-04 shot toy … 32 M White San … CA
5 9 Micha… 2015-01-04 shot nail… 39 M Hisp… Evans CO
6 11 Kenne… 2015-01-04 shot gun 18 M White Guth… OK
7 13 Kenne… 2015-01-05 shot gun 22 M Hisp… Chan… AZ
8 15 Brock… 2015-01-06 shot gun 35 M White Assa… KS
9 16 Autum… 2015-01-06 shot unar… 34 F White Burl… IA
10 17 Lesli… 2015-01-06 shot toy … 47 M Black Knox… PA
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
# threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>Exercise 5
- Create a new chunk.
- Read the data stored at https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv
- View it using the Environment pane.
- View it using
glimpse().
Using data from your computer
TODO: replace with video
Video summary:
- Before you can use data from your computer, you must upload it to the server.
- Data can be uploaded using the Files pane.
- Always upload data to the same location as your R code.
- For data you upload, give
read_csv()the path to the CSV file. - You must assign the loaded data to a variable, which you should give a descriptive name.
- Use the Environment pane to view the data.
The Upload button in the Files pane lets you upload a file from your computer to R Studio. R Studio uses file extensions to guess what the file contains. A file extension is a sequence of characters, starting with a . at the end of a file name.
.csv- CSV file.rmd- R Markdown file
Make sure that any file you upload has the correct file extension.
We’ll upload shootings.csv from the previous exercise.
- Click the
Uploadbutton. - Ensure the
Target directoryis where you want the uploaded file to appear. For this module it should read~/lifesavr. The~(pronounced “tilde”) means yourHomedirectory on the R Studio server. The/lifesavrmeans the folder namedlifesaverinHome. - Click the
Choose filebutton and select the file you want to upload. After you select a file, its name appears next to the button. - Click the
**OK**button.
The file should appear in the Files pane in your lifesavr folder.
Exercise 6
- Use your web browser to download https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv to your computer.
- Upload
shootings.csvto the server. - Create a new chunk.
- Read
shootings.csvinto a variable with a descriptive name.
In which city was the earliest recorded shooting?
Selecting rows with filter()
TODO: replace with video
Video summary:
- The
filter()command selects rows from a dataset which match criteria we set. - The simplest filter uses
==(equals equals), to test if the row is an exact match. - We can use other filters like
<or>to match criteria in numeric columns. - We can combine multiple filters to get exactly the rows we need.
# load gapminder dataset
library(gapminder)
# filter rows where country is equal to the word "Kenya"
# remember to double equals (==) rather than single (=)
gapminder %>%
filter(country == "Kenya")
# A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 1972 53.6 12044785 1222.
6 Kenya Africa 1977 56.2 14500404 1268.
7 Kenya Africa 1982 58.8 17661452 1348.
8 Kenya Africa 1987 59.3 21198082 1362.
9 Kenya Africa 1992 59.3 25020539 1342.
10 Kenya Africa 1997 54.4 28263827 1360.
11 Kenya Africa 2002 51.0 31386842 1288.
12 Kenya Africa 2007 54.1 35610177 1463.
# select rows where year is greater than 2000
gapminder %>%
filter(year > 2000)
# A tibble: 284 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2002 42.1 25268405 727.
2 Afghanistan Asia 2007 43.8 31889923 975.
3 Albania Europe 2002 75.7 3508512 4604.
4 Albania Europe 2007 76.4 3600523 5937.
5 Algeria Africa 2002 71.0 31287142 5288.
6 Algeria Africa 2007 72.3 33333216 6223.
7 Angola Africa 2002 41.0 10866106 2773.
8 Angola Africa 2007 42.7 12420476 4797.
9 Argentina Americas 2002 74.3 38331121 8798.
10 Argentina Americas 2007 75.3 40301927 12779.
# … with 274 more rows
# select rows with low life expectancy
gapminder %>%
filter(lifeExp < 35)
# A tibble: 33 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Angola Africa 1952 30.0 4232095 3521.
6 Angola Africa 1957 32.0 4561361 3828.
7 Angola Africa 1962 34 4826015 4269.
8 Burkina Faso Africa 1952 32.0 4469979 543.
9 Burkina Faso Africa 1957 34.9 4713416 617.
10 Cambodia Asia 1977 31.2 6978607 525.
# … with 23 more rows
# combine multiple filters
gapminder::gapminder %>%
filter(country=="Kenya") %>%
filter(year > 2000) %>%
filter(lifeExp < 55)
# A tibble: 2 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 2002 51.0 31386842 1288.
2 Kenya Africa 2007 54.1 35610177 1463.The following chunk filters the gapminder dataset to include only rows where the country column equals “Kenya”.
library(gapminder)
gapminder %>% filter(country == "Kenya")
# A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 1972 53.6 12044785 1222.
6 Kenya Africa 1977 56.2 14500404 1268.
7 Kenya Africa 1982 58.8 17661452 1348.
8 Kenya Africa 1987 59.3 21198082 1362.
9 Kenya Africa 1992 59.3 25020539 1342.
10 Kenya Africa 1997 54.4 28263827 1360.
11 Kenya Africa 2002 51.0 31386842 1288.
12 Kenya Africa 2007 54.1 35610177 1463.The == is called an “operator”. It compares values from the column on the left hand side with the value specified on the right hand side. The value must match the column type. The value "Kenya" was in quotes because the country column is a factor.
The “greater than” operator > filters numeric data.
gapminder %>% filter(year > 2000)
# A tibble: 284 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2002 42.1 25268405 727.
2 Afghanistan Asia 2007 43.8 31889923 975.
3 Albania Europe 2002 75.7 3508512 4604.
4 Albania Europe 2007 76.4 3600523 5937.
5 Algeria Africa 2002 71.0 31287142 5288.
6 Algeria Africa 2007 72.3 33333216 6223.
7 Angola Africa 2002 41.0 10866106 2773.
8 Angola Africa 2007 42.7 12420476 4797.
9 Argentina Americas 2002 74.3 38331121 8798.
10 Argentina Americas 2007 75.3 40301927 12779.
# … with 274 more rowsThis chunk filters rows where year is greater than 2000.
The opposite of the > operator is the < operator. This filters numeric columns which are less than a value.
Combined filters
gapminder::gapminder %>%
filter(country=="Kenya") %>%
filter(year > 2000)
# A tibble: 2 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 2002 51.0 31386842 1288.
2 Kenya Africa 2007 54.1 35610177 1463.Exercise 7
Filter gapminder to show countries with a population greater than 100 million.
The results should look like this:
# A tibble: 77 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Bangladesh Asia 1987 52.8 103764241 752.
2 Bangladesh Asia 1992 56.0 113704579 838.
3 Bangladesh Asia 1997 59.4 123315288 973.
4 Bangladesh Asia 2002 62.0 135656790 1136.
5 Bangladesh Asia 2007 64.1 150448339 1391.
6 Brazil Americas 1972 59.5 100840058 4986.
7 Brazil Americas 1977 61.5 114313951 6660.
8 Brazil Americas 1982 63.3 128962939 7031.
9 Brazil Americas 1987 65.2 142938076 7807.
10 Brazil Americas 1992 67.1 155975974 6950.
# … with 67 more rows
Exercise 8
Show countries with a population greater than 100 million and life expectancy greater than 70.
The results should look like this:
# A tibble: 27 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Brazil Americas 2002 71.0 179914212 8131.
2 Brazil Americas 2007 72.4 190010647 9066.
3 China Asia 1997 70.4 1230075000 2289.
4 China Asia 2002 72.0 1280400000 3119.
5 China Asia 2007 73.0 1318683096 4959.
6 Indonesia Asia 2007 70.6 223547000 3541.
7 Japan Asia 1967 71.4 100825279 9848.
8 Japan Asia 1972 73.4 107188273 14779.
9 Japan Asia 1977 75.4 113872473 16610.
10 Japan Asia 1982 77.1 118454974 19384.
# … with 17 more rows
Sorting data using arrange()
remind them they know how to make scatter and boxplots
- "what is the size of the largest diamond (by carat) in the
diamondsdataset? - “what cut were the three largest diamonds in that dataset?”
TODO: replace with video
diamonds %>% arrange(-carat) %>% head(3)
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43Combining filtering and sorting {filtersort}
TODO: replace with video
What was the year Kenyans had the lowest life exp:
gapminder::gapminder %>% filter(country=="Kenya") %>%
arrange(lifeExp) %>%
head(6)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 2002 51.0 31386842 1288.
6 Kenya Africa 1972 53.6 12044785 1222.What was the highest year? All that changes is the minus sign (reverse sorting)
gapminder::gapminder %>%
filter(country=="Kenya") %>%
arrange(-lifeExp)
# A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1987 59.3 21198082 1362.
2 Kenya Africa 1992 59.3 25020539 1342.
3 Kenya Africa 1982 58.8 17661452 1348.
4 Kenya Africa 1977 56.2 14500404 1268.
5 Kenya Africa 1997 54.4 28263827 1360.
6 Kenya Africa 2007 54.1 35610177 1463.
7 Kenya Africa 1972 53.6 12044785 1222.
8 Kenya Africa 2002 51.0 31386842 1288.
9 Kenya Africa 1967 50.7 10191512 1057.
10 Kenya Africa 1962 47.9 8678557 897.
11 Kenya Africa 1957 44.7 7454779 944.
12 Kenya Africa 1952 42.3 6464046 854.Combining rows using summarise()
TODO: replace with video
- Often you have lots of data and need to make summaries of it — e.g. to calculate the average of a column
- The
summarise()function takes many rows and uses a function to convert those into fewer rows. - We can use many different functions with summarise, but
- common choices are functions for descriptive statistics, like
mean,median, orsd(short for standard deviation)
mtcars %>% summarise(average_mpg = mean(mpg))
average_mpg
1 20.09062Using filter() and summarise() together
- Using the pipe (
%>%), we can combine multiple steps - It’s common to want to filter out certain rows, before using
summarise
mtcars %>%
filter(am==1) %>%
summarise(mean(mpg))
mean(mpg)
1 24.39231Grouping data with group_by
TODO: replace with video
- In our data we may have categorical variables (e.g. gender, or country)
- We often want to compute summaries for each group
- Using
filter(), we could make a summary for each group, one by one; thegroup_byfunction does this for us - If you add
group_by()to a pipeline then all the subsequent steps are run once for each group - Be careful only to group by categorical variables
We might make a plot like this:
mtcars %>%
ggplot(aes(factor(cyl), mpg)) +
geom_boxplot()But what if we want these numbers in a table (or to report in our report)? We can do that using group_by and summarise…
mtcars %>%
group_by(cyl) %>%
summarise(average_mpg = mean(mpg))
# A tibble: 3 x 2
cyl average_mpg
* <dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1We can also group by two variables at once and get a row for each combination:
mtcars %>% group_by(cyl, am) %>% summarise(mean(mpg))
# A tibble: 6 x 3
# Groups: cyl [3]
cyl am `mean(mpg)`
<dbl> <dbl> <dbl>
1 4 0 22.9
2 4 1 28.1
3 6 0 19.1
4 6 1 20.6
5 8 0 15.0
6 8 1 15.4Check your knowledge
Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 3.
- What is the
%>%symbol called and what does it do? - What is the
<-symbol called and what does it do?
Practice problems
Additional questions
- In the gapminder dataset, what country had the highest life expectancy in 1952? (Use
arrange,filterandhead)
gapminder::gapminder %>%
filter(year == 1952) %>%
arrange(-lifeExp) %>%
head(1)
# A tibble: 1 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Norway Europe 1952 72.7 3327728 10095.- What continent had the highest GDP in 2011? (Use
arrange,group_by, andsummarise)
gapminder::gapminder %>%
group_by(continent) %>%
summarise(average_gdp = mean(gdpPercap)) %>%
arrange(-average_gdp)
# A tibble: 5 x 2
continent average_gdp
<fct> <dbl>
1 Oceania 18622.
2 Europe 14469.
3 Asia 7902.
4 Americas 7136.
5 Africa 2194.- Make a boxplot showing life expectancy by continent. (Use
filter,ggplotandgeom_boxplot)
gapminder::gapminder %>%
filter(year > 2000) %>%
ggplot(aes(continent, lifeExp)) +
geom_boxplot()“Mega problem”
Describe these as the ‘end of level boss characters’. You need to combine all your skills to beat them…
Make a table which shows the average life expectancy for each continent, sorted from highest to lowest:
gapminder::gapminder %>%
group_by(continent) %>%
summarise(life_expectancy = mean(lifeExp)) %>%
arrange(-life_expectancy)
# A tibble: 5 x 2
continent life_expectancy
<fct> <dbl>
1 Oceania 74.3
2 Europe 71.9
3 Americas 64.7
4 Asia 60.1
5 Africa 48.9Broken script to fix
- Fix a ‘broken’ script: Start a NEW R session and make this code work:
liibrary(todyverse)
# make a density plot of of life expectacy with different color lines for each continent
gapminder::gapminder %>%
ggplote(aes("lifeExp", colr = "Continent")) geom_density()
# select only years after 1990
gapminder::gapminder %>%
filter(year > 1990)
ggplot(aes(year, lifeExp, color=continent)) +
geom_jitter()
NOTE - we will know all the errors they will see so can provide hints for each of them
Correct version would be:
library(tidyverse)
# make a density plot of of life expectacy with different color lines for each continent
gapminder::gapminder %>%
ggplot(aes(lifeExp, color = continent)) +
geom_density()
# select only years after 1990
gapminder::gapminder %>%
filter(year > 1990) %>%
ggplot(aes(year, lifeExp, color=continent)) +
geom_jitter()